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A  Parabolic  Load  Balancing  Method  1 


Alan  Heirieh  &:  Stephen  Taylor 
Scalable  Concurrent  Programming  Laboratory 
California  Institute  of  Technology 


Abstract 

This  paper  presents  a  diffusive  load  balancing  method  for  scal¬ 
able  multicomputers.  In  contrast  to  other  schemes  which  are  provabiy 
correct  the  method  scales  to  large  numbers  of  processors  with  no  in¬ 
crease  in  run  time.  In  contrast  to  other  schemes  which  are  scalable  the 
method  is  provabiy  correct  and  the  paper  analyzes  the  rate  of  conver¬ 
gence.  To  control  aggregate  cpu  idle  time  it  can  be  useful  to  balance 
the  load  to  specifiable  accuracy.  The  method  achieves  arbitrary  accu¬ 
racy  by  proper  consideration  of  numerical  error  and  stability. 

This  paper  presents  the  method,  proves  correctness,  convergence 
and  scalability,  and  simulates  applications  to  generic  problems  in  com¬ 
putational  fluid  dynamics  (CFD).  The  applications  reveal  some  useful 
properties.  The  method  can  preserve  adjacency  relationships  among 
elements  of  an  adapting  computational  domain.  This  makes  it  use¬ 
ful  for  partitioning  unstructured  computational  grids  in  concurrent 
computations.  The  method  can  execute  asynchronously  to  balance  a 
subportion  of  a  domain  without  affecting  the  rest  of  the  domain. 

Theory  and  experiment  show  the  method  is  efficient  on  the  scal¬ 
able  multicomputers  of  the  present  and  coming  years.  The  number 
of  floating  point  operations  required  per  processor  to  reduce  a  point 
disturbance  by  90%  is  168  on  a  system  of  512  computers  and  105  on 
a  system  of  1,000,000  computers.  On  a  typical  contemporary  multi¬ 
computer  [19]  this  requires  82.5  ps  of  wall-clock  time. 


xThe  research  described  in  this  report  is  sponsored  primarily  by  the  Advanced  Re¬ 
search  Projects  Agency,  ARPA  Order  number  8176,  and  monitored  by  the  Office  of  Naval 
Research  under  contract  number  N000 14-9 1-J- 1986. 


1  Introduction 


The  scale  of  scientific  applications  and  the  computers  on  which  they  run  is 
growing  rapidly.  Disciplines  such  as  particle  physics  and  computational  fluid 
dynamics  pose  numerous  problems  which  until  recently  have  exceeded  the 
memory  and  cpu  capacities  of  existing  computers.  Moreover  these  disciplines 
appear  ready  to  pose  new  problems  which  exceed  the  capacities  of  the  largest 
computers  that  will  be  built  in  this  decade.  The  Grand  Challenges  represent 
a  set  of  problems  solvable  by  computers  with  TeraFlops  (109)  performance. 
If  present  trends  continue  these  will  be  supplanted  within  a  decade  by  a  set 
of  challenges  requiring  PetaFlops  (1012)  performance. 

This  increase  in  problem  size  is  made  possible  by  the  dramatic  reductions 
in  size  and  cost  of  VLSI  technology  and  parallel  computing.  Currently  our 
research  in  computational  fluid  dynamics  uses  scalable  multicomputers  with 
roughly  500  processors  [19,  23,  24].  Research  efforts  are  underway  to  develop 
within  three  years  time  scalable  multicomputers  with  hundreds  of  thousands 
of  functional  units  [16,  21].  These  trends  suggest  that  limits  to  the  growth 
of  scientific  computing  are  more  likely  to  be  determined  in  the  next  several 
years  by  software  technology  than  by  hardware  limitations. 

A  potentially  limiting  software  technology  is  the  class  of  methods  to  solve 
the  load  balancing  problem.  Most  numerical  algorithms  require  frequent  syn¬ 
chronization.  If  a  load  distribution  on  a  multicomputer  is  uneven  then  some 
processors  will  sit  idle  while  they  wait  for  others  to  reach  common  synchro¬ 
nization  points.  The  amount  of  potential  work  lost  to  idle  time  is  proportional 
to  the  degree  of  imbalance  that  exists  among  the  processor  workloads.  Since 
this  loss  also  increases  with  processor  count  it  can  be  valuable  to  control  the 
accuracy  of  the  resulting  balance  and  to  trade  off  the  quality  of  the  balance 
against  the  cost  of  rebalancing. 

This  paper  presents  a  load  balancing  method  for  mesh  connected  scalable 
multicomputers  with  any  number  of  processors.  The  method  can  balance  the 
load  to  an  arbitrary  degree  of  accuracy.  Theory  and  experiment  show  the 
method  is  inexpensive  under  realistic  conditions.  While  it  is  effective  on  con¬ 
temporary  systems  with  under  1,000  processors  it  is  specifically  intended  to 
scale  to  systems  with  tens  and  hundreds  of  thousands  of  processors.  The 
total  wall  clock  time  for  the  method  decreases  as  the  processor  count  in- 
creases.  The  method  was  developed  to  solve  the  dynamic  (run  time)  load 
balancing  problem.  It  has  also  proven  useful  for  cases  of  static  (initial  load 
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time)  balancing. 

The  method  assumes  the  computation  is  sufficiently  fine  grained  that 
work  can  be  treated  as  a  continuous  quantity.  This  is  reasonable  in  an 
application  domain  like  computational  fluid  dynamics  in  which  units  of  work 
represent  grid  points  in  a  simulation.  Each  processor  is  typically  responsible 
for  thousands  of  grid  points  and  can  exchange  individual  points  with  other 
processors. 

Cybenko  [6]  has  published  a  method  which  is  similar  in  several  respects 
to  this  one.  He  proves  asymptotic  convergence  of  an  iterative  scheme  on 
arbitrary  interconnection  topologies.  Two  other  articles  have  appeared  more 
recently  on  diffusive  load  balancing  methods.  Boillat  [4]  demonstrates  poly¬ 
nomial  convergence  on  arbitrary  interconnection  topologies  using  a  Markov 
analysis.  Horton  objects  to  the  polynomial  convergence  demonstrated  in 
[4]  and  the  lack  of  bounds  on  accuracy  in  the  previous  work.  He  applies 
a  multigrid  concept  [11,  14]  to  accelerate  convergence  of  a  simple  diffusive 
scheme. 

The  method  presented  in  this  paper,  while  developed  independently,  re¬ 
sembles  a  special  case  of  Cybenko’s  method  restricted  to  mesh  connected 
topologies.  It  differs  from  Cybenko’s  method  with  regard  to  issues  of  numer¬ 
ical  stability.  The  objections  in  [11]  lack  rigor.  This  paper  refutes  them  via 
formal  demonstrations  of  convergence  rates,  accuracy,  and  scaling.  These 
results  show  that  convergence  is  rapid,  accuracy  is  limited  only  by  machine 
precision,  and  superlinear  speedup  can  be  achieved  for  cases  of  practical  in¬ 
terest  in  CFD. 


2  A  Parabolic  Model 

A  number  of  articles  have  proposed  solutions  to  the  load  balancing  problem 
in  recent  years  [2,  5,  10,  12,  13,  15,  17,  18,  22].  Many  of  these  solutions  are 
reliable  and  efficient  for  computer  systems  with  small  numbers  of  processors. 
Unfortunately  many  of  them  do  not  scale  well  to  systems  with  large  numbers 
of  processors.  It  is  well  known  that  in  a  scalable  algorithm  the  amount  of  work 
performed  in  parallel  grows  more  rapidly  than  the  amount  of  work  performed 
in  the  serial  part  of  the  calculation  as  the  size  of  the  problem  increases  [1,  9]. 
Scalable  algorithms  tend  to  be  highly  concurrent  and  this  fact  often  makes  it 
difficult  to  prove  that  they  are  correct.  A  load  balancing  method  for  scalable 
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multicomputers  should  be  scalable  but  should  not  sacrifice  reliability. 

It  is  worth  noting  that  a  class  of  random  placement  methods  have  been 
proposed  for  scalable  multicomputers  [2,  10].  These  methods  are  scalable 
and  are  reliable  under  the  assumption  that  disturbances  occur  frequently 
and  have  short  lifespans.  These  assumptions  do  not  hold  in  a  domain  like 
CFD  where  disturbances  arise  occasionally  and  are  long  lasting.  As  a  result 
we  are  unable  to  take  advantage  of  the  rewards  these  methods  offer. 

A  method  is  reliable  if  it  can  be  shown  to  compute  a  correct  solution 
within  a  predictable  number  of  steps.  The  simplest  reliable  load  balancing 
method  collects  load  statistics  from  all  processors,  computes  the  average  load, 
and  broadcasts  the  average  to  all  processors.  Each  processor  then  exchanges 
work  with  it’s  neighbors  so  that  the  new  workloads  equal  this  average.  Un¬ 
fortunately  this  simplest  method  is  not  scalable  because  it  is  inherently  serial. 
The  number  of  terms  in  the  calculation  of  the  average  load  is  proportional 
to  the  number  of  processors  in  the  computer  system.  Although  this  cost  can 
be  made  logarithmic  with  an  octree  data  structure  there  is  a  more  severe 
cost  associated  with  interprocessor  communication.  The  current  state  of  the 
art  in  mesh  routing  technology  requires  a  nonconflicting  communication  path 
for  each  message.  The  opportunities  for  path  conflicts  known  as  “blocking 
events”  increase  factorially  with  the  number  of  processors  in  the  computer 
system.  This  simplest  reliable  method  is  not  scalable  because  the  time  lost 
to  blocking  events  can  grow  factorially  with  the  size  of  the  computer  system. 

A  method  is  scalable  if  it  can  run  efficiently  on  computer  systems  with 
very  large  numbers  of  processors.  Due  to  the  effects  of  Amdahl’s  law  [1] 
most  scalable  methods  are  concurrent  algorithms  in  which  the  computation 
is  distributed  among  the  processors  in  the  computer  system.  What  these 
methods  gain  in  scalability  they  often  lose  in  reliability  because  they  lack 
formal  proofs  of  correctness  and  convergence. 

As  one  example  of  the  problems  which  can  arise  in  concurrent  algorithms 
consider  a  simple  concurrent  method  in  which  each  processor  adjusts  it’s  load 
to  equal  the  average  of  the  loads  at  it’s  immediate  neighbors.  This  method  is 
distributed  and  scalable  and  is  easily  seen  to  be  convergent.  Unfortunately  it 
is  well  known  that  it  converges  to  solutions  of  the  Laplace  equation  V2$  =  0. 
This  equation  is  known  to  admit  sinusoidal  solutions  which  are  not  equilibria. 
As  a  result  this  method,  although  scalable,  is  not  reliable. 

This  paper  leverages  existing  numerical  and  analytic  techniques  to  derive 
a  reliable  and  scalable  load  balancing  method.  The  method  is  based  on 
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properties  of  the  parabolic  heat  equation  ut  —  aV2u  =  0.  The  heat  equation 
describes  a  process  in  nature  whereby  thermal  energy  diffuses  away  from  hot 
regions  and  into  cold  regions  in  a  volume  until  the  entire  volume  is  of  the 
same  temperature.  A  literal  interpretation  of  the  terms  in  the  heat  equation 
reads  that  the  rate  of  change  in  temperature  u  at  each  point  in  the  volume 
is  determined  by  the  local  curvature  V2u  times  a  diffusion  rate  a.  This 
paper  applies  finite  difference  techniques  to  derive  an  unconditionally  stable 
discrete  form  of  this  equation,  and  uses  a  scalable  iterative  method  to  invert 
the  resulting  coefficient  matrix.  The  end  result  is  a  concurrent  load  balancing 
method  which  is  scalable,  reliable  and  efficient. 


3  A  Parabolic  Algorithm 

The  algorithm  consists  of  a  simple  arithmetic  iteration,  which  is  performed 
concurrently  by  every  processor  in  the  multicomputer.  Each  step  of  the 
iteration  requires  7  floating  point  operations  at  each  processor.  The  iteration 
calculates  an  expected  workload  at  each  processor.  Processors  periodically 
exchange  units  of  work  with  their  immediate  neighbors  in  order  to  make  their 
actual  workload  equal  to  this  expected  workload.  The  algorithm  contains 
parameters  which  control  the  accuracy  of  the  resulting  solution.  In  many 
applications  an  accuracy  of  10%  is  sufficient.  In  these  cases  only  24  iterations 
are  required  to  reduce  a  point  disturbance  by  90%  regardless  of  the  size  of  the 
multicomputer.  This  paper  presents  the  algorithm  for  a  three  dimensional 
mesh.  The  reduction  to  two  dimensions  is  described  in  the  discussion  section. 


3,1  Initialization 


Specify  the  accuracy  a  desired  in  the  resulting  equilibrium.  For  example,  to 
balance  to  within  10%  choose  <y  =  0.1.  Determine  the  interval  v  at  which 
processors  will  exchange  work  with  their  immediate  neighbors  according  to 
the  formula 


v  — 


In  ol 


In 


6a 

l+6a 


>  l 


(1) 


Note  that  in  the  interval  0  <  a  <  1  v  is  less  than  or  equal  to  3: 
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'  2  :  0  <  a  <  0.0445 
3:  0.0445  <  a  <  0.622 
2  :  0.622  <  a  <  0.833 
k  1  :  0.833  <  a 


3.2  Execution 

At  every  processor  x,  y,  z  adjust  the  workload  uXtViZ  as  follows: 


7#(®)  — 

ax,y,z  ax,y,z 

for  m=l  to  v 


u(o) 

Am)  _  x,y,z  , 

x’y'z  1  +  6a 

(m— 1) 


[  -  \  +  u (m"1)  + 

\1  +  6a/  \Ux+1<y'z  '  u*-i,y.z‘ 


(2) 


U 


x,y+l,z 


+  u^y-\,z  +  u 


(m— 1) 
x,y,z+ 1 


+  U 


(m-1)  \ 
Mi*- 1/ 


endf or 


Exchange  a  units  of  work  with  every  neighbor  w'. 


7/  —  7/(l/) 

“  ax,yyz 


Repeat  these  steps  until  reaching  equilibrium.  Much  of  the  following 
analysis  will  be  concerned  with  formulating  an  exact  statement  of  the  number 
of  repetitions  required  to  reach  equilibrium.  The  motivation  behind  this 
analysis  is  to  support  claims  of  correctness,  convergence,  and  accuracy.  The 
analysis  develops  a  theory  of  convergence  for  any  disturbance  and  applies 
this  to  the  specific  case  of  a  point  disturbance.  Analysis  appears  to  be  less 
practical  in  many  cases  than  conservative  estimates  derived  from  simulations. 
Following  the  analysis  this  paper  presents  simulations  from  cases  of  interest 
in  CFD  and  distributed  operating  systems.  The  paper  concludes  with  a 
summary  and  discussion. 

4  Reliability  and  Scalability 

This  section  demonstrates  reliability  of  the  method  by  showing  that  for  any 
initial  disturbance  every  component  of  the  disturbance  vanishes  at  an  expo- 
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nential  rate.  It  demonstrates  scalabilty  by  defining  a  lower  bound  on  this  rate 
as  a  function  of  O',  n.  It  applies  the  resulting  theory  to  the  case  of  a  point  dis¬ 
turbance  and  demonstrates  that  weakly  superlinear  speedup  can  be  achieved 
under  realistic  assumptions.  The  appendix  contains  technical  derivations  in 
support  of  this  analysis. 

ACCURACY  OF  THE  JACOBI  ITERATION 


Since  the  algorithm  is  intended  to  observe  strict  accuracy  of  0(a)  it  is  im¬ 
portant  to  verify  that  each  stage  of  the  algorithm  observes  this  accuracy. 
The  justification  of  accuracy  for  the  finite  difference  equation  is  given  in  the 
appendix.  The  accuracy  of  the  coefficient  matrix  inversion  can  be  verified  by 
analyzing  the  spectral  radius  of  the  Jacobi  iteration. 

From  the  Gersgorin  disc  theorem  [7]  the  eigenvalues  A  of  (2)  are  bounded 
|A|  <  ^4^-  Since  the  row  and  column  sums  are  constant  and  the  iteration 
matrix  is  nonnegative  ([7],  theorem  8.1.22)  the  spectral  radius  equals  the  row 
sum 


(fl-T)  =  - 


6  a 


+  6a 


(3) 


Define  the  error  in  a  current  value  u(m^  under  the  iteration  (2)  as  e(w^)  = 
(w(m)  —  w*)  where  u*  is  the  fixed  point  of  the  Jacobi  iteration.  Then  for  v  >  0 


e(uM)  =  e((D^TYu^)  =  (D^T)ye(yf^)  (4) 


which  converges  when  p(D~1T)  <  1  since  /?((£>~1r)I/)  =.  ( p(D~1T ))l/.  In 
order  for  the  algorithm  to  correctly  reduce  the  error  it  is  necessary  to  compute 
the  desired  load  at  each  time  step  to  an  appropriate  accuracy.  In  order  to 
quantify  the  error  define  the  infinity  norm 


||e(^7n^)|jco  =  max  e(?4m)  il  — 

x,y,z  v  x 


x,y,z)\ 


—  max  |  u 

x,y,z 


(m) 

x,y,z 


U, 


x,y,z\ 


Using  this  norm  define  a  necessary  condition  to  improve  the  accuracy  of  the 
solution  u  by  a  factor  a  in  v  steps  to  be  ||e(iZM)||00  <  a||e(u^)||00.  From  (4) 
this  is  satisfied  when  ( p(D~lT))u  <  a  and  thus  (1) 

In  a 

V=  In  ~p[Dz^T) 

The  method  is  unconditionally  stable  and  the  cost  of  this  stability  is  small 
(in  fact  it  is  free  for  0.833  <  a  <  1).  This  suggests  the  possible  use  of  large 
time  steps  to  deal  with  worst  case  disturbances. 


In  ct 


In 


6q 

l-i-Ga 


(5) 
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ELAPSED  TIME  FOR  THE  DIFFUSION 

This  section  determines  the  number  of  artificial  time  steps  r  required  to 
reduce  the  load  imbalance  by  a  factor  a.  It  does  this  by  considering  the 
eigenstructure  of  the  finite  difference  equation  (22)  which  is  rearranged  to 
express  the  change  in  load  with  each  iteration 

'U'x,y,z{t  T  dt)  —  &  [wa7+l T  dt)  T  +  dt) 

~\“'U'x,y+l,z(t  T  dt)  T  ^x,y— l,z(t  “1“  dt) 
“h^r,y,z+i T  dt)  T  uXlytz—i(t  T  dt) 

6wXiy^(t  +  dt)] 

or  as  a  vector  equation  with  matrix  operator  L 


u(t  +  dt)  —  u(t)  =  aLu(t  +  dt)  (6) 

Any  load  distribution  u(t)  can  be  written  as  a  weighted  superposition  of 
eigenvectors  x  of  L 

WC0  = 

hj,k 

Using  this  fact  rewrite  the  vector  equation  (6)  as 

T  dt)x{tj9k  yi  jjk  =  Lat*jtfc(t  +  dt)x{j ^  (7) 

hjyk  ij,k 

Using  the  definition  of  and  the  eigenvalues  of  L 

X ijtk  =  2  [^3  —  cos  ^2 Tri/n —  cos  feirj/n1/3)  —  cos  (27 rfc/n1/3)]  (8) 

(7)  can  be  further  simplified  to 

T  dtjx^j^  [1  +  aXij'k]  ~~  ahjfk{t)xi,j,k)  ~  0 

hj,k 

and  by  the  completeness  and  orthonormality  of  the  eigenvectors 


T  dt)  [1  T  a\ijik]  ^tj,fc(^)  —  0 

1  +  aXijfk 


(9) 
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It  is  apparent  from  equation  (9)  that  convergence  of  the  individual  com¬ 
ponents  is  dependent  upon  the  eigenvalues  A^>.  Reducing  the  amplitude 
of  an  arbitrary  component  i,j,k  by  a  in  r  steps  of  the  method  requires 
[1  +  aX ijfi]  T  <  oi.  The  worst  case  occurs  for  the  smallest  positive  eigen¬ 
value  AqjO,!  =  (2  —  2  cos(27r/n1/3))  which  corresponds  to  a  smooth  sinusoidal 
disturbance  with  period  equal  to  the  length  of  the  computational  grid.  To 
reduce  such  a  disturbance  requires 


r  = 


In 


a 


(10) 


I  In  [l  +  «  (2  —  2 cos  ^73)] 

Convergence  of  this  slowest  component  approaches  In  a-1  for  large  n  since 


lim  In 

n— >oo 


1  +  a  2  —  2  cos 


2tt  y 
n1/3/. 


1 


Convergence  of  highest  wavenumber  component  A(ni/3)/2_lj(ni/3)/2„1(7ii/3^2_1 
is  rapid  because 

In  aT1 


r  = 


In  [1  T  (6  — 


(ii) 


ANALYSIS  OF  A  POINT  DISTURBANCE 

This  section  considers  a  specific  case  of  a  point  disturbance  and  derives  an 
inequality  relating  r,  n  and  a.  The  purpose  in  doing  this  is  to  provide  an  exact 
prediction  of  convergence  of  a  known  case  in  order  to  demonstrate  scaling 
properties.  The  procedure  followed  is  to  describe  the  initial  amplitudes  of  the 
components  of  the  disturbance  and  then  solve  an  inequality  which  describes 
the  magnitude  of  the  disturbance  over  time.  A  periodic  domain  is  assumed 
for  the  purpose  of  analysis.  Simulations  presented  later  in  this  paper  verify 
that  convergence  is  similar  on  aperiodic  domains. 

The  following  text  uses  the  Poisson  bracket  {•,  •)  to  represent  the  inner 
product  operator.  When  discussing  loads  or  eigenvectors  it  uses  u[x,y,z] 
or  Xijtk[x,y,z]  to  denote  the  vector  element  which  corresponds  to  location 
x,  y,  z  of  the  computational  grid  with  the  convention  that  [0, 0, 0]  is  the  first 
element  of  the  vector.  Then  the  initial  disturbance  confined  to  a  particular 
processor  x,  y,z  can  be  written  as  a  superposition  of  eigenvectors  of  L 


“(0)  =  X)  0)xi)7, 

f,m,n 


(12) 
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Assume  the  initial  disturbance  u( 0)  to  be  zero  at  every  element  except 
[x9y,z].  Then 

^(0))  —  xi}j1k[x^y^]  (13j 

This  is  equal  to  the  initial  amplitude  a*j,fc(0)  of  each  eigenvector  Xij^ 


w(0))  — 


y!  ( xl,m}n)  0) 

/,m,n 


/,m,n 

°*»i**(0) 


(14) 


The  computational  domain  has  periodic  boundary  conditions  and  as  a  result 
the  origin  of  the  coordinate  system  is  arbitrary.  Without  loss  of  generality 
place  the  origin  at  the  source  of  the  disturbance  and  take  x  =  y  =  z  =  0. 
This  has  no  effect  on  the  eigenvectors  x^jg  and  from  (13),  (14) 


ai,j,k{ 0)  —  0,  0]  (15) 

Placing  the  origin  at  the  source  of  the  disturbance  is  particularly  convenient 
when  we  consider  the  first  element  of  the  eigenvectors  0, 0].  L  has 

/2  distinct  eigenvalues  A each  of  algebraic  multiplicity  two.  Each 
of  these  eigenvalues  has  geometric  multiplicity  eight,  ie.  has  eight  linearly 
independent  associated  eigenvectors  of  unit  length 

Xi,j,k[xi  Vi  *3  =  ci,j,kFi  (27 rxi/n1/3'j  F2  ( 2i:yj/nlf 3)  F3  (2xzAr/n1/3)  (16) 

where  each  j Pt  is  either  sin  or  cos.  By  choosing  x  =  y  —  z  =  0  this  expression 
(16)  is  zero  except  for  the  single  eigenvector  for  which  F\(x )  =  F2(x)  = 
F3(x)  —  cos(a;).  Without  loss  of  generality  restrict  further  consideration  to 
initial  disturbances  of  the  form 


«[0, 0, 0] (0)  =  £  0, 0, 0]  =  £  (17) 

h3,k  ij,k 

From  (9)  define  the  time  dependent  disturbance  at  any  location  x',  ?/,  z1 

u[x',  y',  z]  ( rdt )  =  ]T  cithk  [1  +  a\tihk]~T  xitjtk[x',  y',  z] 
i,j,k 
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r(a,  n)  n  (total  processors) 


64 

512 

4,096 

8,000 

32K 

256K 

106 

0.1 

7 

6 

a 

\j 

5 

5 

5 

5 

a  0.01 

152 

213 

229 

173 

157 

145 

141 

0.001 

2,749 

5,763 

10,031 

10,139 

9,082 

7,564 

7,003 

Table  1:  Solutions  to  equation  (20)  for  increasing  processor  count  n  and 
accuracy  a.  r  represents  the  number  of  exchange  steps  in  the  method.  See 
figure  1. 


=  ]C  ch,k  t1  +  ^  cos  fax'i/n1^ 

COS  (2ny'j/n1/3)  COS  ^2 7TZf  kin1'3)  (18) 

The  appendix  demonstrates  that  Cij>k  =  (8/ra)1/2  and  thus  the  disturbance  is 
a  summation  of  equally  weighted  eigenvectors.  From  (17)  and  (18)  the  time 
dependent  disturbance  at  0, 0, 0  is  therefore 


u[0,0,0](rdt)  ~ 


n  i,j,k 


i  ,  cio  27 ri  27 vj  2xk 

1  +  o2  |  3  -  cos  p  cos  -/3  cos 


Solving  u[07  0, 0](rdt)  <  a  yields 


-E 

n  —U  \ 

i,3,k  L 


.  2  iri  27 rj  2irk 

1  +  «2  (  3  -  cos  -  cos  -  cos 


<  a 


—r 


(19) 

(20) 


where  i,j,  k  are  indexed  from  0  to  (n1/3)  /2  —  1  and  the  case  i  =  j  =  k  =  0 
is  omitted. 

Table  1  and  figure  1  present  solutions  of  the  inequality  (20).  These  are 
exact  predictions  of  the  number  r  of  exchange  steps  which  must  occur  to 
reduce  a  point  disturbance  by  a  factor  a  on  a  periodic  domain.  Figures  2 
and  3  are  simulated  results  for  two  CFD  cases.  The  first  case  of  partitioning 
an  unstructured  grid  is  a  point  disturbance.  In  the  simulation  r  is  observed  to 
match  the  theoretical  prediction  exactly.  The  second  case  of  rebalancing  after 
a  grid  adaptation  demonstrates  the  value  of  estimating  r  from  simulations. 
The  initial  disturbance  is  not  a  point  and  the  simulation  is  observed  to  require 
170  exchange  steps  before  the  worst  case  discrepancy  drops  to  10%  of  it’s 
original  value. 
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Number  of  processors  (nj 

Figure  1:  Scaled  number  of  exchange  steps  ra  to  achieve  accuracy  or  for  var¬ 
ious  n,  a  following  a  point  disturbance.  See  table  1  and  equation  (20).  Each 
line  is  scaled  by  or.  All  lines  are  initially  increasing  for  small  n  and  asymp¬ 
totically  decreasing  for  larger  n  demonstrating  weak  superlinear  speedup. 

Partition  1,000,000  point  unstructured  grid  on  512  processors  Rebalance  10*6  processors  after  loot  Increase  in  grid  density 
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Figure  2:  Time  course  of  disturbances  for  simulated  CFD  cases.  Left:  largest 
discrepancy  among  512  processors  partitioning  an  unstructured  computa¬ 
tional  grid.  The  initial  disturbance  of  1,000,000  points  confined  to  a  single 
processor  is  reduced  by  90%  after  6  exchanges  (20.625  /is)  in  exact  agree¬ 
ment  with  theory  (  r(0. 1,512)  in  table  1).  Right:  largest  discrepancy  among 
1,000,000  processors  rebalancing  a  disturbance  following  a  bow  shock  adap¬ 
tation.  a  =  0.1,  v  —  3  in  both  cases.  Wall  clock  times  assume  a  32  MHz 
J-machine.  The  interval  between  exchange  steps  is  3.4375  (is. 
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Simulations 


This  section  presents  simulations  of  three  cases  in  which  the  method  is  use¬ 
ful.  The  first  case  is  a  bow  shock  resulting  from  a  million  grid  point  CFD 
calculation  [23] .  The  disturbance  dissipates  rapidly  and  is  nearly  gone  after 
10  exchange  steps.  The  second  case  computes  an  initial  load  distribution 
for  a  million  point  unstructured  grid  problem  on  a  512  node  multicomputer. 
The  simulation  suggests  the  method  may  be  highly  competitive  with  Lanc- 
zos  based  approaches  presented  recently  in  [3,  20].  The  third  case  simulates 
a  multicomputer  operating  system  under  conditions  of  random  load  injec¬ 
tion.  This  case  demonstrates  that  the  method  can  effectively  balance  large 
disturbances  which  occur  frequently  and  randomly. 

Parameters  for  the  simulations  are  based  on  two  scalable  multicomputers. 
The  first  is  a  512  node  J-machine  [19]  which  is  in  use  at  Caltech  for  research 
in  CFD  and  scalable  concurrent  computing.  The  second  is  a  hypothetical 
1,000,000  node  J-machine.  These  two  design  points  represent  a  continuum 
of  scalable  multicomputers.  The  method  is  practical  at  both  ends  of  this 
continuum  and  presumably  at  all  points  in  between. 

Wall  clock  times  are  based  on  a  hand  coded  implementation  of  the  method 
in  J-machine  assembler  and  assumes  32  MHz  processors.  Each  repetition  of 
the  method  requires  110  instruction  cycles  in  3.4375  fis.  All  simulations  are 
run  with  a  =  0.1  and  v  =  3  resulting  in  3  iterations  between  each  exchange 
of  work. 


5.1  Bow  Shock  Adaptation 

In  CFD  calculations  it  is  common  to  adapt  a  computational  grid  in  response 
to  properties  of  a  developing  solution.  This  simulation  considers  an  adapta¬ 
tion  that  results  from  a  bow  shock  in  front  of  a  Titan  IV  launch  vehicle  with 
two  boosters  (figure  3).  The  grid  has  been  adapted  by  doubling  the  density 
of  points  in  each  area  of  the  bow  shock.  As  a  result  the  initial  disturbance 
shows  locations  in  the  multicomputer  where  the  workload  has  increased  by 
100%  due  to  the  introduction  of  new  points.  After  10  exchanges  of  work 
the  imbalance  has  already  decreased  dramatically.  After  70  exchanges  the 
disturbance  is  scarcely  visible.  This  simulation  assumes  a  1,000,000  proces¬ 
sor  J-machine.  This  example  illustrates  the  weak  persistence  of  low  spatial 
frequencies. 
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Figure  A:  Disturbance  following  a  bow  shu<  k  adaptation  uf  a  computational 
firii  1  on  a  million  processor  J  machine.  First  frame  h  the  initial  disturbance 

n ■- 1 ill.Liit',  fruni  f ho  adaptfl.ti .  Siihfl^qii^nt  f r*i r ■  i r .nr  ratwl  hy  IN  rv- 

rhange  steps*  The  disturbance  is  reduced  dramatically  by  the  second  frame. 
After  71)  ex«  hange  steps  only  weak  low  frequency  componenls  remain. 


L3 


I  igure  4:  Disturbance  during  an  initial  load  of  d  million  point  unstructured 
computational  grid  onto  a.  5J2  processor  J-macMn®.  1  hr  first  frame  reppe- 

-ents  1 1 j ■  ■  entitvgrid  assigned  tu  a  hosl  : kxL-  on  t  Ik'  i 1 1 u I tieompiit ■  t.  ]  hi-  is  h 

point  disturbance  and  the  resulting  behavior  is  in  exact  agreement  with  the 
■'analysis  presented  earlier  in  i  his  paper.  Subset urnt  frames  are  separated  by 
Jn  exchange  slep>.  After  Til  exchange  siops  i  hr  v-  orkluail  is  alreiulv  rough K 
balanced.  A  balance  within  I  grid  point  was  achieved  after  5UU  exchange 
steps:. 

5.2  Partitioning  an  Unstructured  Grid 

In  parol  Id  (TD  applications  die  static  load  balancing  problem  has  hern 
i he  subject  of  recent  attention  [3,  20].  In  addition  to  finding  an  equitable 

distribution  nl  U  or  L  I  Iih  |sn  iblcri.  imisl  nhwerVC  1  hi-  ;ul  flit  uirial  cm msl  rtf i 111 
of  preserving  adjacency  relations  hips  among  elements  of  an  unstructured 
cornpui at ional  grid.  The  bad  balancing  method  presented  in  i hit^  paper  can 
satisfy  the  adjatrt'my  t  wilsE  r .±i til  if  a\  each  exchange  *1-ep  ir  stdect^  exchange 
candidates  that  observe  the  constraint. 

Hus  simulation  model*  an  init  ial  point  disturbance  by  assigning  I  JJOUJJlXl 
point s  of  an  utisUrn cl  tired  coftipiiiatk>n&l  grid  to  a  single  host  node  of  a  512 
processus  ,1-machine  (sbi1  figure  1).  [l  exec  nit**  the  algorithm  while  ubsen  - 
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!  itniiY-  -5;  Disturbances  requiring  from  rapid  injection  of  large  random  loads 
on  ei  million  processor  J-niarhifle,  After  each  exchange  step  a  point  distur¬ 
bance  introduced  at  a  randomly  chosen  processor.  Hie  average  value  of 
ear  1 1  poitii  disturbance'  is  30*000  limes  the  initial  system  load  average.  The 
interval  between  successive  frames  is  IUU  exchange  steps,  Afler  7U(J  injections 
i  lie  wuis i  rase  discrepancy  was  Lu.7^7  1  hues  l-Iif p  iunial  load  average.  Hus 
demonstrates  t lit*  algorithm  was  balancing  the  load  faster  than  disturbances 
were  created.  After  load  injection  ceased  an  additional  100  repetitions  with 
iui  new  disturbance  reduced  1  lie  worst  case  discrepancy  from  15,737  to  50 
\  imes  the  in  dial  load  average. 


big  the  adjacency  conistmtri  ar  each  exchange  step,  Ihe  initial  disturbance 
weis  reduced  by  IHI'.f  after  fi  exchange  steps  in  exact  ag«rmnit  with  th«or> 
(rl  0.1,512]  in  table  1).  After  50  exchange  steps  the  worst  case  discrepancy 
was  099  grid  points.  After  162  steps  the  worst  case  discrepancy  was  200  grid 
points,  HJ'K  of  the  load  average,  A  balance  wiiliin  ]  grid  poini  was  achieved 
on  r  he  5001  h  step. 


o*3  Rautlmii  Load  Iiij^cliuu 
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To  be  useful  in  practical  contexts  the  method  must  be  able  to  rebalance  dis¬ 
turbances  faster  than  they  arise.  This  simulation  (figure  5)  demonstrates  the 
behavior  of  the  method  on  a  1,000,000  processor  J-machine  under  demand¬ 
ing  conditions.  An  initially  balanced  distribution  is  disrupted  repeatedly 
by  large  injections  of  work  at  random  locations.  Injection  magnitudes  are 
uniformly  distributed  between  0  and  60,000  times  the  initial  load  average. 
The  simulation  alternates  repetitions  of  the  algorithm  with  injections  at  ran¬ 
domly  chosen  locations.  After  700  repetitions  and  injections  the  worst  case 
discrepancy  was  15,737  times  the  initial  load  average.  This  is  less  than  the 
average  injection  magnitude  of  30,000  at  each  repetition.  This  demonstrates 
the  method  was  balancing  the  load  faster  than  the  injections  were  disrupting 
it.  The  last  random  injection  occured  on  step  700.  After  100  additional  ex¬ 
change  steps  without  intervening  injections  the  worst  case  discrepancy  had 
reduced  from  15,737  to  50  times  the  initial  load  average. 


6  Summary  and  Discussion 

This  paper  has  demonstrated  that  a  diffusion  based  load  balancing  method 
is  efficient,  reliable,  and  scalable.  It  has  shown  through  rigorous  theory  and 
empirical  simulation  that  the  method  is  inexpensive  for  important  generic 
problems  in  CFD.  This  section  discusses  a  few  remaining  points  and  indicates 
future  directions  for  this  research. 

An  important  property  of  the  method  is  it’s  ability  to  preserve  adjacency 
relationships  among  elements  of  a  computational  domain.  Preserving  adja¬ 
cency  permits  CFD  calculations  to  minimize  their  communication  costs.  The 
method  preserves  adjacency  if  it  does  so  at  each  exchange  step. 

To  make  this  requirement  concrete  consider  an  unstructured  computa¬ 
tional  grid  for  a  CFD  calculation.  When  the  time  comes  for  the  load  balanc¬ 
ing  method  to  select  grid  points  to  exchange  with  neighboring  processors  it 
selects  points  in  such  a  way  that  average  pairwise  distance  among  all  points 
is  minimal.  One  way  to  do  this  is  to  assume  that  each  processor  represents 
a  volume  of  the  computational  domain  and  to  select  for  exchange  those  grid 
points  which  occupy  the  exterior  of  the  volume.  The  selected  points  would 
transfer  to  adjacent  volumes  where  their  neighbors  in  the  computational  grid 
already  reside.  In  order  for  this  to  be  practical  it  must  be  inexpensive  to  iden¬ 
tify  these  exterior  points.  In  problems  where  the  computational  grid  has  been 
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generated  contiguously  this  is  presumably  a  simple  matter  because  adjacent 
points  will  have  data  structures  that  identify  their  neighbors.  In  more  general 
problems  where  the  data  is  not  already  ordered  the  use  of  priority  queues 
appears  promising  due  to  their  O(nlogn)  complexity. 

It  is  worth  noting  that  the  method  can  be  used  to  rebalance  a  local  portion 
of  a  computational  domain  without  interrupting  the  computation  which  is 
occurring  on  the  rest  of  the  domain.  This  can  be  useful  in  CFD  problems 
where  some  portions  of  the  domain  converge  more  quickly  than  others  and 
adaptation  might  occur  locally  and  frequently. 

This  paper  presented  the  method  and  analysis  for  a  domain  which  has 
periodic  boundary  conditions  and  is  logically  spherical.  In  practice  multi¬ 
computer  meshes  are  rarely  periodic.  In  our  simulations  we  implemented 
aperiodic  boundaries  by  imposing  the  Neumann  condition  du/dx  —  0  in 
each  direction  x ,  y,  2.  This  requires  a  simple  modification  to  the  iteration  (2) 
so  that  processors  immediately  outside  the  mesh  appear  to  have  the  same 
workload  as  processors  one  step  inside  the  mesh.  For  example,  if  the  proces¬ 
sors  are  indexed  from  1  through  n  in  the  x  dimension  then  Uo,y>z  =  u2ly}Z  and 


^n+1,2/,2  —  l,y,Z' 

The  algorithm  is  presented  for  three  dimensional  scalable  multicomputers. 
It  reduces  for  two  dimensional  cases  by  redefining  v  and  the  iteration  (2)  as 


follows: 


7  /  - 

In  a 

1/  - 

In  4a 

111  1+4  a 

tt<°> 

x,y  .  .  | 

1  +  4a 
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_L_  7/(m  1) 

^  Ux-l,y 
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The  worst  case  behavior  of  the  method  is  determined  by  disturbances  of 
low  spatial  frequency.  This  is  the  basis  of  the  objections  in  [11]  and  a  con¬ 
ventional  response  would  be  to  apply  a  multigrid  method  [11,  14].  Such  a 
method  can  have  logarithmic  scaling  in  n  and  0(ri)  convergence.  The  anal¬ 
ysis  presented  in  this  paper  demonstrates  that  wall  clock  times  can  actually 
decrease  as  n  increases  and  this  suggests  considering  other  methods  of  treat¬ 
ing  the  worst  case  disturbance.  One  such  method  would  be  to  use  very  large 
time  steps  in  order  to  accelerate  convergence  of  the  low  frequency  compo¬ 
nents.  The  unconditional  stability  of  this  method  makes  this  an  attractive 
option.  Although  this  would  increase  the  error  in  the  high  frequency  compo¬ 
nents  these  components  can  be  quickly  corrected  by  local  iterations.  We  are 
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presently  considering  the  costs  associated  with  such  iterations. 


7  Appendices 


This  appendix  presents  technical  justifications  for  statements  relied  upon  in 
the  analysis. 


FINITE  DIFFERENCE  FORMULATION  OF  THE  HEAT  EQUATION 
Consider  the  parabolic  heat  equation  in  three  dimensions 

Uf  %l  —  Zlxx  "h  Uj ly  T  Uxx 

Taylor  expanding  in  t  with  all  derivatives  evaluated  at  ( x,y,z,t ) 
u(x,y,z,t  +  dt)  =  u{x,y,  z,t)  +  utdt  +  0(dt2) 


(21) 


ut  - 


+  dt) 

dt 


+  0  (dt) 


Obtain  the  second  order  terms  by  expanding  in  spatial  variables  where  omit¬ 
ted  coordinates  are  interpreted  as  (x,  y,  z.  1) 

u(x  +  dx,  =  u{x, +  uxdx  + 

dx2  dx3  4 

U'xx  ~  f"  Vjxxx  ~Z  f"  0(dx  i 
l  0 

u(x  dx .  ~,  *j  u^x , m,  m,  *  j  uxdx  -f- 

dx2  dx3 


'U'XX  - 


2  ^xxx  g  “I"  0  (dx  ) 

u(x +  dx, + u(x  -  dx,  =  2u(x,-,-,-)  +  uxxdx2 +  0(dx4) 
'u(x  +  dx, +  u(x  -  dx, -2u(x,-,-,-)\  2x 

d & - j  +0(-dx  > 

Similar  expansions  in  y,z  show  that  the  heat  equation  can  be  rewritten 

u(-,-,-,t  +  dt)  —  _  ( u(x  +  dx,  •,  •,  •)  +  u(x  -  dx,  •)  -  2 u(-,  •,  •) 

dx2 


dt 

+ 


'vXjjj  ±  dV>  '2  •)  +  y  z  il '}  ~  2  2 ') 

,  dy 2 
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+ 


u(-, -,z  +  dz,  •)  +  «(•,  -,z-  dz,  •)  -  2 u(-,  -,  ■,  •) 
dz 2 


+  0(dt,  dx 2 ,  dy2,  dz2) 


From  (21)  the  finite  difference  formulation  is 


i{x,y,z,t  +  dt)  -u(x,y,z,t)  1 

- dt - =  dx 2  +  dx' V +  ~<*x,y, *)+ 


u(#,  2/  +  dy,  zi  t)  +  «(®?  2/  —  dy,  z,  t )  +  u(x,  y,z  +  dz,  i)+ 
u(x,  y,z  —  dz,  t )  —  6u(a;,  y,  z,  t )) 

Setting  ot  =  and  taking  the  spatial  terms  on  the  right  at  time  yields 

an  unconditionally  stable  implicit  scheme 


u(x,  y,  z,  t)  —  (1  +  6a)u(a;,  y,  z,  t  +  dt)  —  a  [u(x  +  dx,  y,z,t  +  dt)+  (22) 

u(x  -  dx,  y,z,t  +  dt)  +  u(x,  y  +  dy,  z ,  t  +  dt)  +  u(x,  y-dy,z,t  +  dt)+ 

u(x,  y,z  +  dz,t  +  dt)  +  u(x,  y,z  —  dz,t  +  dt)] 

JACOBI  ITERATION  TO  COMPUTE  A^u 

In  order  to  compute  solutions  at  successive  time  intervals  dt  we  must 
invert  the  relationship  =  Au^t+dt^  by  solving 

u(t+dt)  _  A-i u(t)  (23) 

From  (22)  it  is  apparent  that  A  has  diagonal  terms  (1  +  6  a)  and  six  off  diago¬ 
nals  a.  Let  A  =  ( D  —  T )  where  D  is  this  diagonal.  Then  (D  —  T)u^t+dt^  = 
implies  u^t+dt^  —  -D-1Ttd<+rf<)  +  D~lvAA  This  relation  is  satisfied  by  fixed 
points  of  the  Jacobi  iteration 

[w(t+<fc)] =  D-iT  [u(‘+^)] (m-1}  +  £,-iuW  (24) 

The  matrix  D~lT  has  a  zero  diagonal  and  six  offdiagonal  terms £)_1 
is  a  diagonal  matrix  with  terms  The  iteration  (24)  is  the  central  loop 

of  the  method  (2).  Note  that  this  iteration  is  everywhere  convergent  with 
spectral  radius  defined  by  (3). 

UNIT  IMPULSE  DERIVATION 
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This  section  demonstrates  that  the  eigenvector  normalization  constant 
is  equal  to  (8/ra)1/2  for  all  eigenvectors  Xij,k.  From  (16)  a  necessary  condition 
for  a  normalized  eigenvector  is 


1  =  ch,k  £  cos2  (27r^7i)  cos2  (27r^)  cos2  (2r 


x,y,z 

1 


zk 

n1/3 


=  C-fc8  £  (X  +  C°S  i1  +  C°S  ^ )  i1  +  C°S  4" 

yj 


*,y,z 


=  S((1  +  cos4f^)[1  +  cos4ff 


n1/3 


V73 


nl/3 


+ E  (4jrJ;)  E  “s  (4»  Js)  (4  +  c“ 4^ 

Simplify  the  preceding  expression  by  the  following 

Lemma  1 


zk 

TTs 


Ecos(4^)=0 


Proof: 


Ec“(4^)  =  e& («•“;) 

v .  *  4irxi 

=  Re  £  e 

=  sex;  («■»)' 


=  /?e 


=  0 


e «1'3  1 


-  (e^)"l/3 


1  -  e  n1/3 


(25) 


Q.E.D. 
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Repeated  application  of  lemma  (1)  to  equation  (25)  yields 

zk 


1  =  c?'4,?J(1+cos4,rJ;)(1+cos4,r 


n 


1/3 


g 


Chhk] 


E  (i  +  ~  *&)  +  E  (« *&)  (‘ + ■ 


1  +  X)  (  COS  47T 

L  x,y,z  x,y,z 


ni/3 


=  Ch,kcn 


;8 


zk 

i?3 


(26) 


From  which  we  conclude 


ci,j,k  — 


v  i,j,k 
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